Lab. 2 ARMA models

Preliminaries

Load libraries

If you get an error go to the packages panel, click install and type the name of the library (you only need to do this once).

library(fpp2)
library(tidyverse)
library(readxl)
library(lmtest) #contains coeftest function
library(TSstudio)

To install the next one you need to:

  • Download the file MLTools_0.0.31.tar.gz from Moodle (R libraries section).
  • Go to the packages panel,
  • Click Install
  • Under Install from select Package Archive File
  • Navigate to the folder where you downloaded the package and select it.
  • Click install and wait for the setup to finish.
library(MLTools) 

Set the working directory

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

Reading time series data and EDA

Load dataset

fdata <- read_excel("ARMA_series.xls")

Visualize the first rows:

head(fdata)
## # A tibble: 6 × 6
##        y1     y2      y3    y4      y5     y6
##     <dbl>  <dbl>   <dbl> <dbl>   <dbl>  <dbl>
## 1  0       0      0       0     0       0    
## 2  0.153  -0.717 -1.71    0     1.67   -0.492
## 3 -0.407  -0.389  0.0653  1.45 -0.257   1.60 
## 4  0.0383  0.393  0.509   1.85 -0.0995 -0.525
## 5 -0.0557 -1.40  -0.237   4.16 -1.09    0.320
## 6  0.0899  0.809 -0.630   2.95  3.49   -0.622

Convert it to a (multi)time series object

In this case:

  • We assume that there are no time gaps
  • There is no temporal information in the data, so we use the natural numbers as index. This is the default in ts if no additional arguments are used.
fdata_ts <- ts(fdata)

and we get some basic information

ts_info(fdata_ts)
##  The fdata_ts series is a mts object with 6 variables and 250 observations
##  Frequency: 1 
##  Start time: 1 1 
##  End time: 250 1

We use the column index to select a time series

y <- fdata_ts[ ,2]

Identification and fitting process

ACF and PACF plots of the time series

We use them to identify significant lags and order

ggtsdisplay(y, lag.max = 20)

ARMA model selection and fit

Now we propose a structure of the model by selecting the values of p and q

p <- 1
q <- 1

We fit the model with estimated order

arima.fit <- Arima(y, order=c(p, 0, q), include.mean = FALSE)

And we use summary to display the training errors and estimated coefficients.

summary(arima.fit) 
## Series: y 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##           ar1      ma1
##       -0.0231  -0.7444
## s.e.   0.0891   0.0643
## 
## sigma^2 = 0.8641:  log likelihood = -335.89
## AIC=677.79   AICc=677.88   BIC=688.35
## 
## Training set error measures:
##                       ME      RMSE       MAE     MPE     MAPE      MASE
## Training set -0.03653454 0.9258414 0.7483762 297.631 350.5146 0.4578694
##                      ACF1
## Training set -0.003444551

Model diagnosis

The next function explores the statistical significance of the estimated coefficients

coeftest(arima.fit) 
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1 -0.023139   0.089100  -0.2597   0.7951    
## ma1 -0.744401   0.064328 -11.5720   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We use the information to update the ARMA structure if needed. Using autoplot with the fitted model results in a root plot to check stationarity and invertibility. Keep in mind that this plots show inverse roots (that is, \(1/root\))

autoplot(arima.fit) 

Check of the residuals

The residual plots can be used to check the hypothesis of the ARMA model:

  • The time plot of the residuals should look like white noise (no patterns)
  • The histogram and density curve help assess the normality.
  • The ACF and PACF should show no significant values (up to 5%), in particular in the first lags.
CheckResiduals.ICAI(arima.fit, bins = 100, lag=20)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1) with zero mean
## Q* = 24.314, df = 18, p-value = 0.145
## 
## Model df: 2.   Total lags used: 20

Remember: you should also check the p-value of the Ljung-Box test in the output of CheckResiduals.

An alternative residual plot is obtained with

ggtsdisplay(residuals(arima.fit), lag.max = 20)

Keep in mind that if the residuals are not white noise, you should change the order of the ARMA model.

Fitted time series and forecasting with the fitted model

We begin by reconstructing the series with the model. That is, we obtain the fitted value at \(t\) \[\hat{y}_{t|t-1}\] for all \(t\) in the training data.

The following plot compares the original time series with the reconstructed (fitted) values:

autoplot(y, series = "Real") +
  forecast::autolayer(arima.fit$fitted, series = "Fitted")

Now, for a true forecast of future values we use the horizon parameter h:

y_est <- forecast::forecast(arima.fit, h = 5)
autoplot(y_est)

ARMA time series simulation

The series we have been using are synthetic time series, obtained as in the example below:

set.seed(2024)
sim_ts <- arima.sim(n = 250, 
                 list(ar = c(0.8897, -0.4858), ma = c(-0.2279, 0.2488)),
                 sd = sqrt(0.1796))

See (Hyndman and Athanasopoulos 2021), sections 9.3 and 9.4 for the conditions that the model coefficients should verify.

Additional Code

Arma model selection with train/test split and performance measures

Let us go over the same dataset but this time we will use functions from other libraries to perform a basic train/test split and get some performance measures.

n <- length(y)
n_test <- floor(0.1 * n)

library(TSstudio)
y_split <- ts_split(y, sample.out = n_test)

y_TS <- y_split$test
y_TR <- y_split$train

Now we will briefly go over the same model identification and fitting steps, but using the training values.

ACF and PACF plots

ggtsdisplay(y_TR, lag.max = 20)

ARMA model selection and fit

p <- 1
q <- 1

Model fit

arima.fit <- Arima(y_TR, order=c(p, 0, q), include.mean = FALSE)

Estimated coefficients.

summary(arima.fit) 
## Series: y_TR 
## ARIMA(1,0,1) with zero mean 
## 
## Coefficients:
##           ar1      ma1
##       -0.0267  -0.7460
## s.e.   0.0912   0.0635
## 
## sigma^2 = 0.885:  log likelihood = -304.94
## AIC=615.87   AICc=615.98   BIC=626.12
## 
## Training set error measures:
##                       ME    RMSE       MAE      MPE     MAPE     MASE
## Training set -0.03280064 0.93654 0.7566988 325.6628 382.3785 0.461487
##                     ACF1
## Training set -0.00351327

Model diagnosis

coeftest(arima.fit) 
## 
## z test of coefficients:
## 
##      Estimate Std. Error  z value Pr(>|z|)    
## ar1 -0.026683   0.091205  -0.2926   0.7699    
## ma1 -0.746021   0.063459 -11.7560   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Root plot

autoplot(arima.fit) 

Check residuals

CheckResiduals.ICAI(arima.fit, bins = 100, lag=20)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,1) with zero mean
## Q* = 19.891, df = 18, p-value = 0.339
## 
## Model df: 2.   Total lags used: 20

Model fitted values, forecast predictions and performance evaluation

Once we are satisfied with the model we can use it to get a true forecast covering the same time span as the test set:

y_fc <- forecast::forecast(arima.fit, h = n_test)

Keep in mind that this forecast object is more than just a set of values: It is a list containing a lot of information about the model we fitted and the forecast for the selected horizon, as suggested by this plot:

autoplot(y_fc)

But if we want the forecasted values we can get them like this:

y_fc$mean
## Time Series:
## Start = 226 
## End = 250 
## Frequency = 1 
##  [1]  9.200863e-01 -2.455056e-02  6.550796e-04 -1.747941e-05  4.664012e-07
##  [6] -1.244493e-08  3.320666e-10 -8.860494e-12  2.364235e-13 -6.308462e-15
## [11]  1.683279e-16 -4.491475e-18  1.198455e-19 -3.197824e-21  8.532716e-23
## [16] -2.276775e-24  6.075092e-26 -1.621010e-27  4.325321e-29 -1.154120e-30
## [21]  3.079526e-32 -8.217064e-34  2.192550e-35 -5.850355e-37  1.561043e-38

Note that as indicated by the plot these values decay to 0 very quickly. We can also use test_forecast from TSstudio to get a plot of the actual values, the model’s fitted values (for training or in-sample), and these forecasted values (for the test set or out of sample).

test_forecast(actual = y,
              forecast.obj = y_fc,
              test = y_TS)

We can get the model’s performance measures as follows:

forecast::accuracy(y_fc, y)
##                       ME    RMSE       MAE       MPE      MAPE      MASE
## Training set -0.03280064 0.93654 0.7566988 325.66275 382.37846 0.4614870
## Test set     -0.03431537 1.01252 0.8809293  93.91238  97.21708 0.5372513
##                     ACF1 Theil's U
## Training set -0.00351327        NA
## Test set     -0.48783927 0.8948283

Ypu can check that e.g. the RSME obtained here is the same as the direct computation:

sqrt(mean((y_est$mean - y_TS)^2))
## Warning in .cbind.ts(list(e1, e2), c(deparse(substitute(e1))[1L],
## deparse(substitute(e2))[1L]), : non-intersecting series
## [1] NaN

Generating a time series using the ARMA process

Recall the equation of the ARMA(p, q) process:

\[(1 - \phi_1 B - \phi_2 B^2)y_t = (1 - \theta_1 B - \theta_2 B^2)\epsilon_t\]

As a follow up pf the introduction to stochastic processes in the previous session we will see how to generate a time series that is a realization of one such process. We use two vectors of coefficients for the AR and MA parts (with \(p\leq 2, q\leq 2\)). See the code lines below and make sure you choose the signs correctly.

ar_cff <- c(0, 0)
ma_cff <- c(1/3, 0)
sd <- sqrt(0.1796)

We will create a gaussian white noise time series. In order to do that we get a sample of \(n\) random values from a standard normal \(Z \sim N(0, 1)\).

set.seed(2024)

n <- 800
n <- 250

w <- ts(rnorm(n, mean = 0, sd = sd) )

Now we generate a time series \(y_t\) using the ARMA(p, q) process. In order to start the process we assume that the values of \(y_t, w_t\) for \(t < 1\) are 0.

y_sim <- numeric(n)

y_sim[1] <- w[1]
y_sim[2] <- w[2] + ar_cff[1] * y_sim[1] - ma_cff[1] * w[1]

for(i in 3:n){
  y_sim[i] <- ar_cff[1] * y_sim[i - 1] + ar_cff[2] * y_sim[i - 2] + 
    w[i] - ma_cff[1] * w[i - 1] - ma_cff[2] * w[i - 2]
}

y_sim <- ts(y_sim)

Our generated y_sim time series is not as good as the one we obtained previously with arima.sim but, as the following plot illustrates, it exhibits a very similar dynamic.

ggtsdisplay(y_sim, lag.max = 20)

Suggested exercise:

  • Can you identify and fit an ARMA model for y_sim?
  • Are the coefficients you find related to our choices in constructing y_sim?

References

Hyndman, R. J., and G. Athanasopoulos. 2021. Forecasting: Principles and Practice. OTexts. https://otexts.com/fpp3/.